Tutorials - Interspeech 2024

Tutorial 1: Seamless Communication: Towards a Universal Translation System

Morning Tutorial (AM): 9:00 – 12:15

Hall: Melambus

Description

In our increasingly interconnected world, the ability to communicate and comprehend
information across multiple languages has become crucial. Speech translation technology,
encompassing speech-to-text (S2T) and speech-to-speech translation (S2ST), is
instrumental in bridging these communication gaps and facilitating access to multilingual
content.
A key focus of this tutorial is Meta’s Seamless suite of models, a significant contribution to
the field of speech translation. These models, including SeamlessM4T, SeamlessExpressive,
SeamlessStreaming, and Seamless, have revolutionized translation performance and
enhanced cross-lingual communication.
Participants will gain insights into the workings of SeamlessM4T, a Massively Multilingual &
Multimodal Machine Translation model that provides state-of-the-art translation and
transcription across approximately 100 languages. The tutorial will also cover the
SeamlessExpressive and SeamlessStreaming models, which enable expressive and real-time
translations, respectively.
Finally, we will delve into the Seamless model, which unifies the high-quality, multilingual
capabilities of SeamlessM4T v2, the rapid response of SeamlessStreaming, and the
expressivity preservation of SeamlessExpressive into a single system.

	Ms. Sravya Popuri Meta, USA	Sravya is a research engineering lead at Meta working on foundational models for speech translation. Her current research focuses on multilingual and multimodal speech translation. Sravya also served as the tech lead for the SeamlessM4T project, which was recognized in Time’s Best Inventions of 2023. Before joining the industry, she received her Masters degree from Language Technologies Institute, Carnegie Mellon University.
	Dr. Hongyu Gong Meta, USA	Hongyu is a research scientist at Meta, and her main research interests include multilingual and multimodal modeling. She serves as Area Chair for ACL Rolling Review, and a regular reviewer of ML/NLP/Speech conferences including NeurIPS, ICML, ACL, Interspeech and ICASSP. She obtained a Ph.D. degree from the University of Illinois at Urbana-Champaign with research focus on language representations and generation.
	Ms. Anna Sun Meta, USA	Anna is a research engineer at Meta, and her research interests include multilingual low-resource translation, simultaneous translation, and multimodal speech translation. She recently presented on Seamless Communication in the expo at NeurIPS. Prior to this, she worked on content understanding and recommendation systems at Meta. She received her Bachelor’s degree from Duke University.
	Dr. Kevin Heffernan Meta, USA	Kevin is a research engineer at Meta, and his main research interests include multilingual and multimodal translation, and data augmentation methods. He recently presented at ACL ‘23, and is a regular reviewer for ML/NLP conferences such as ACL, NAACL, and EMNLP. He obtained a Ph.D. degree from the University of Cambridge, with a research focus on computational linguistics.

Tutorial 2: Neural Speech and Audio Coding

Morning Tutorial (AM): 9:00 – 12:15

Hall: Acesso

Description

Neural speech and audio coding is an emerging area where data-driven approaches provide unprecedented success in compressing speech and audio signals. Meanwhile, due to the highly subjective nature of human auditory perception, there are remaining challenges and opportunities in developing such a method based purely on objective metrics and data. The tutorial will introduce a comprehensive background for those who are interested in speech and audio coding as a new application area of their machine learning/deep learning research or domain experts who want to adopt data-driven technologies. The tutorial will cover various topics ranging from straightforward end-to-end models, generative models, perceptual loss functions, advanced quantization mechanisms, and hybrid approaches that use both data-driven and traditional methods.

	Prof. Minje Kim University of Illinois at Urbana-Champaign; Amazon Lab126, USA	Minje Kim is an Associate Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign (2024–present) and a Visiting Academic at Amazon Lab126 (2020–present). Prior to that, he was an Associate Professor at Indiana University (2016–2023). Minje Kim earned his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign (2016) after having worked as a researcher at ETRI, a national lab in Korea (2006–2011). Minje Kim’s research focus is on developing machine learning models for speech and audio problems. Throughout his career, he has been recognized with various awards, including the NSF Career Award (2021), the Indiana University Trustees Teaching Award (2021), and the IEEE SPS Best Paper Award (2020), among others. He is an IEEE Senior Member, Vice Chair of the IEEE SPS AASP TC, a Senior Area Editor for IEEE/ACM Trans. on Audio, Speech, and Language Processing, an Associate Editor for EURASIP Journal of Audio, Speech, and Music Processing, and a Consulting Associate Editor for IEEE Open Journal of Signal Processing. He holds over 50 patents as an inventor.
	Dr. Jan Skoglund Google LLC, USA	Jan Skoglund is currently leading a team at Google in San Francisco, CA, which specializes in developing speech and audio signal processing components. His work has made significant contributions to Google’s software products (Meet) and hardware products (Chromebooks). Jan received his Ph.D. degree in 1998 from Chalmers University of Technology in Sweden. His doctoral research was centered around low-bitrate speech coding. After obtaining his Ph.D., he joined AT&T Labs-Research in Florham Park, NJ, where he continued to work on low-bitrate speech coding. In 2000, he moved to Global IP Solutions (GIPS) in San Francisco and worked on speech and audio processing technologies, including compression, enhancement, and echo cancellation, which were particularly tailored for packet-switched networks. GIPS’ audio and video technology was integrated into numerous deployments by prominent companies such as IBM, Google, Yahoo, WebEx, Skype, and Samsung. The technology was later open-sourced as WebRTC after GIPS was acquired by Google in 2011. He is a Senior Member of IEEE and has been actively involved in AASP and SLTC.

Prof. Minje Kim

University of Illinois at Urbana-Champaign; Amazon Lab126, USA

Minje Kim is an Associate Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign (2024–present) and a Visiting Academic at Amazon Lab126 (2020–present). Prior to that, he was an Associate Professor at Indiana University (2016–2023). Minje Kim earned his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign (2016) after having worked as a researcher at ETRI, a national lab in Korea (2006–2011). Minje Kim’s research focus is on developing machine learning models for speech and audio problems. Throughout his career, he has been recognized with various awards, including the NSF Career Award (2021), the Indiana University Trustees Teaching Award (2021), and the IEEE SPS Best Paper Award (2020), among others. He is an IEEE Senior Member, Vice Chair of the IEEE SPS AASP TC, a Senior Area Editor for IEEE/ACM Trans. on Audio, Speech, and Language Processing, an Associate Editor for EURASIP Journal of Audio, Speech, and Music Processing, and a Consulting Associate Editor for IEEE Open Journal of Signal Processing. He holds over 50 patents as an inventor.

Dr. Jan Skoglund

Google LLC, USA

Jan Skoglund is currently leading a team at Google in San Francisco, CA, which specializes in developing speech and audio signal processing components. His work has made significant contributions to Google’s software products (Meet) and hardware products (Chromebooks). Jan received his Ph.D. degree in 1998 from Chalmers University of Technology in Sweden. His doctoral research was centered around low-bitrate speech coding. After obtaining his Ph.D., he joined AT&T Labs-Research in Florham Park, NJ, where he continued to work on low-bitrate speech coding. In 2000, he moved to Global IP Solutions (GIPS) in San Francisco and worked on speech and audio processing technologies, including compression, enhancement, and echo cancellation, which were particularly tailored for packet-switched networks. GIPS’ audio and video technology was integrated into numerous deployments by prominent companies such as IBM, Google, Yahoo, WebEx, Skype, and Samsung. The technology was later open-sourced as WebRTC after GIPS was acquired by Google in 2011. He is a Senior Member of IEEE and has been actively involved in AASP and SLTC.

Tutorial 3: Responsible Development and Translation of Clinical Speech Analytics

Full-day Tutorial (AM+PM): 9:00 – 17:00

Hall: Homer

Description

Interest in using speech patterns to identify health conditions has significantly increased. The idea is that any neurological, mental, or physical deficits affecting speech production can be objectively assessed via speech analysis. Recent speech-based machine learning (ML) studies for automatic diagnosis, prognosis, and longitudinal tracking of health conditions typically follow the supervised learning paradigm that has succeeded in consumer-oriented speech applications. However, clinical speech analytics faces unique challenges, such as limited data availability, limited speech representations, and uncertain diagnostic labels. These differences mean that ML approaches successful in other contexts may not perform as well when applied in real-world clinical settings. With translation to real-world clinical applications in mind, this tutorial will discuss:

1) designing speech elicitation tasks for different clinical conditions,

2) speech data collection and hardware verification,

3) developing and validating speech representations for clinical measures,

4) developing reliable clinical prediction models, and

5) ethical considerations.

	Prof. Visar Berisha Arizona State University, USA	Visar Berisha is a Professor at Arizona State University, with a joint appointment in the College of Engineering and the College of Health Solutions; and Associate Dean for Research Commercialization in the College of Engineering. With a focus on speech as a biomarker, his main research interests include development of statistical signal processing and machine learning tools for reliably extracting clinically-relevant information from human biosignals. His research is primarily funded by the National Institutes of Health, the Department of Defense, and the National Science Foundation. This work has led to many academic publications, several patents, and a VC-backed company. Berisha’s work has been featured in the New York Times, on ESPN, National Public Radio, the Wall Street Journal, and a number of other international media outlets. He is the 2023-2024 ISCA Distinguished Lecturer.
	Prof. Julie Liss Arizona State University, USA	Julie Liss, PhD CCC-SLP is a Professor and Associate Dean in the College of Health Solutions. Her research interests include clinical speech and language analytics and neuroscience, with a focus on motor-speech disorders. Her research has been funded by NIH continuously since 1994. She is a fellow of the American Speech-Language-Hearing Association and has served as Editor-in-Chief for the discipline’s flagship Journal of Speech, Language and Hearing Research, and currently serves as the Senior Editor for Registered Reports.
	Dr. Si-ioi Ng Arizona State University, USA	Si-ioi Ng is a postdoctoral scholar in the College of Health Solutions at Arizona State University. He received the B.Eng and Ph.D in Electronic Engineering from The Chinese University of Hong Kong in 2018 and 2023, respectively. His doctoral study combined speech processing, clinical expertise, and machine learning (ML) to classify Cantonese-speaking preschool children with speech sound disorders. He also co-developed a large-scale Cantonese corpus of children’s speech collected from ~2000 children. His recent research focuses on assessing neurodegenerative disorders based on speech biomarkers, with a particular interest in addressing the reliability of ML-based clinical speech analytics.
	Prof. Ingo Siegert Otto von Guericke University, Germany	Ingo Siegert, PhD is Junior professor for Mobile Dialog Systems at the Otto von Guericke University Magdeburg since 2018. His research interests focus on signal-based analyses and interdisciplinary investigations of (human-) human-computer interaction. A further recent focus is on privacy and security of speech-based interactions. He is the secretary of the ISCA SIG on Security and Privacy in Speech Communcation (SPSC). He is very active in the organization of workshops and conferences as well as in Science Communication.
	Dr. Nicholas Cummins King’s College London, UK	Dr Nicholas (Nick) Cummins is a Lecturer in AI for speech analysis for health at the Department of Biostatistics and Health Informatics at King’s College London. He is also the Chief Science Officer for Thymia, a start-up developing technologies to analyze facial microexpressions & speech patterns to make mental health assessments faster, more accurate and objective. Nick’s current research interests include how to translate speech processing into clinical research and practise . He is also fascinated by the application of speech processing and machine learning techniques to improve our understanding of different health conditions. He is particularly interested in applying these techniques to mental health disorders. Nick has (co-)authored over 200 conference and journal papers leading to over 6700 citations (h-index: 43). He is a frequent reviewer for IEEE, ACM and ISCA journals and conferences and serves on program and organizational committees; and is an associate editor of the Computer, Speech, and Language journal. He has previously organized tutorials (2015,2017) and special sessions (2016, 2022, 2023, 2024) at Interspeech and gave a survey talk on speech-based health assessments at Interpeech 2022 and presented an invited talk on this topic at ASRU 2023.
	Dr. Nina R Benway University of Maryland-College Park, USA	Nina R Benway, PhD CCC-SLP is a postdoctoral fellow in Electrical and Computer Engineering at the University of Maryland-College Park whose research focuses on clinical speech analytics for childhood speech sound disorders. Nina’s expertise spans several stages of clinical speech technology research and development. She is the curator of the publicly-available PERCEPT Corpora, which contain > 125,000 utterances from 473 children with and without speech sound disorders. She develops neurocomputationally inspired feature sets and machine learning algorithms to automate clinician judgment during speech therapy sessions. She is also involved in two separate (US) National Institutes of Health-funded clinical trials testing how well automated speech therapy works for children with speech sound disorders.

Tutorial 4: Embracing Inclusivity: Bridging Gaps, Breaking Barriers, and Beyond Challenges in developing real-time embedded conversational AI for assistive technology.

Morning Tutorial (AM): 9:00 – 12:15

Hall: Syndicate 2.1

Description

Human interaction is inherently multimodal, utilizing visual and auditory cues to navigate
conversations and surroundings. Thus, human-machine interfaces (HMIs), such as those found
in conversational AI systems, predominantly emphasize visual feedback based on textual input.
However, this approach conspicuously neglects the needs of the blind and visually impaired.
This tutorial addresses the scarcity of resources detailing deployment of conversational AI,
particularly in the real-time embedded assistive technology contexts. Challenges in designing
the appropriate HMI to enhance accessibility, with a focus on optimizing response latency
through hardware and software, are delineated. Through theoretical and practical insights,
including live demonstrations and code snippets, this tutorial endeavors to bridge gaps and
break barriers, fostering inclusivity in conversational AI development with an emphasis on
real-time embedded deployment.

	Dr. Tal Rosenwein VP of research at OrCam Technologies LTD, Adjunct lecturer at Tel-Aviv University, CS department, Israel	Tal Rosenwein serves as an adjunct lecturer in the department of computer science at Tel-Aviv University focusing on audio processing using deep learning. Concurrently, he holds the position of Vice President of Research at OrCam, a cutting-edge company dedicated to developing wearable, smart assistive devices tailored for individuals with visual and hearing impairments, as well as those with reading differences. He is also one of the organizers of iSpeech, the Israeli seminar on audio technologies. Tal obtained his Master’s degree in digital signal processing and machine learning at Ben Gurion University in Israel. He has made significant contributions to the field of automatic speech recognition (ASR) and speech enhancement, reflected in the dozens of patents he authored in these domains, and several papers he published in audio processing and automatic speech recognition.

Dr. Tal Rosenwein

VP of research at OrCam Technologies LTD, Adjunct lecturer at Tel-Aviv University, CS
department, Israel

Tal Rosenwein serves as an adjunct lecturer in the department of computer science at Tel-Aviv
University focusing on audio processing using deep learning. Concurrently, he holds the
position of Vice President of Research at OrCam, a cutting-edge company dedicated to
developing wearable, smart assistive devices tailored for individuals with visual and hearing
impairments, as well as those with reading differences. He is also one of the organizers of
iSpeech, the Israeli seminar on audio technologies. Tal obtained his Master’s degree in digital
signal processing and machine learning at Ben Gurion University in Israel. He has made
significant contributions to the field of automatic speech recognition (ASR) and speech
enhancement, reflected in the dozens of patents he authored in these domains, and several
papers he published in audio processing and automatic speech recognition.

Cancelled - Tutorial 5: Open Speech Research Platform: ARIA-TRE and VOXLAB

CANCELLED

Morning Tutorial

Description

This half-day tutorial will focus on “speech breathing” as a lung function and mental health biomarker. Speech breathing refers to how expired air and respiratory mechanics are utilised to produce the airflow necessary for phonation. Seven psychology, respiratory, or AI researchers will deliver the tutorial.

The tutorial offers a comprehensive training program for two speech and voice research tools. ARIA-TRE is an open-access voice data collection tool. Researchers can tailor their design and purpose to generate speech and voice collection programs. VoxLab is an AI-based automatic acoustic and phonetic analysis platform.

· Leaning outcome 1: Using ARIA-TRE to design speech and voice studies.

· Leaning outcome 2: Running VoxLab for acoustic and phonetic analysis.

· Leaning outcome 3: Applying two tools in lung function and other healthcare.

The tutorial also includes other activities and issues, e.g. open science in voice data, speech-breathing in healthcare, AI algorithms in diagnosis, and speech-breathing technology in the future.

Presenter Information

	Dr. Biao Zeng Department of Psychology, University of South Wales, UK	Dr. Biao Zeng is a psychology lecturer, course leader, cognitive psychologist, and applied neuroscientist. He has successfully organised two tutorials at the British HCI conference (HCI, Digital Health and Business Opportunities in China and UK, Bournemouth, 2016; Data, Virtual Interaction and Digital Health between China and UK, Belfast, 2018). In recent years, his work has focused on designing and validating a “helicopter task” speech-breathing task embedded in ARIA-TRE. In 2023, the Welsh government funded him to set up the Wales-wide speech-breathing research network. Department of Psychology, University of South Wales, UK
	Dr. Xiaoyu Zhou The First Clinical College, Bengbu Medical University, Anhui, China	Dr. Xiaoyu Zhou is an Associate Chief Physician and Associate Professor with 17 years of experience in respiratory diseases. She specialises in managing chronic respiratory conditions and pulmonary rehabilitation therapy. In recent years, Dr. Zhou has undertaken five research projects, resulting in two invention patents and nine utility model patents. In 2023, Dr. Zhou received funding from the Anhui Provincial Department of Education for an outstanding talent cultivation project in higher education and was invited by the University of South Wales, United Kingdom, for a year-long academic visit. Under Dr. Biao Zeng’s supervision, Dr. Xiaoyu Zhou attended a series of studies to develop one speech-breathing test.
	Dr. Tom Powell Cwm Taf Morgannwg NHS University Health Board, Wales Institute of Digital Information, University of South Wales, UK	Dr Tom Powell is the Head of Innovation for Cwm Taf Morgannwg University Health Board and Regional Innovation Lead for the Cwm Taf Morgannwg Regional Partnership Board. He is also a Digital Innovation Ambassador for the Life Science Hub Wales. He is now on part-time secondment to the University of South Wales as a Visiting Fellow and Associate Director of the Wales Institute of Digital Information. Tom’s role often involves bringing together like-minded colleagues to create collaborative multi-organisation innovation to deliver new approaches and ideas. These include the ORIEL and ARIA–TRE initiatives with Respiratory Innovation Wales. With over 20 years of experience, he has worked within the NHS, academia, clinical research and the wider public sector, completing and evaluating projects, research trials, and organisational change initiatives. With a doctorate in Respiratory Physiology, he holds many grants and is the CI on several research projects.
	Mr. Robert Salter Institute of Engineering and Technology (MIET), Chartered Management Institute (MCMI), British Computer Society (MBCS)	Mr. Robert Salter is a Consultant Innovation Scientist with over 30+ years of experience delivering digital transformation and medical device integration. A blended career working within the NHS, academia, corporate, and managing SME companies. A worldwide reputation for contributions to health care informatics and system integration with physiological data in both EMEA and the US. Actively involved in research projects, supervising students, lecturing in Digital Transformation Cyber Security, and has several patent applications for medical devices. Committed to ethical use of data and value-sensitive design to ensure data is used in an accountable and legal manner.
	Dr. Tim Bashford Wales Institute of Digital Information, University of Wales, Trinity, St. Davies, UK	Dr. Tim Bashford is the research lead and deputy director for the Wales Institute of Digital Information (WIDI), a tripartite collaboration between the University of Wales Trinity Saint David, the University of South Wales, and Digital Health and Care Wales (DHCW), focused on digital healthcare and social innovation. Tim researches across the fields of artificial intelligence, machine learning, computational physics, computational biomedicine and digital healthcare. He has particular areas of interest in the simulation of light-tissue interaction, gamification of public health, detection and classification of respiratory disease through recorded voice and the impact of generative AI on academic integrity. Tim has a PhD in computer science and numerous academic publications.
	Mr. Mark Huntley Wales Institute of Digital Information, University of Wales, Trinity, St. Davies, UK	Mark has held the position of researcher at Wales Institute of Digital Innovation (WIDI) for three years. He holds a BSc in web development and an MSc in Software Engineering and AI. He focuses on AI and ML operations, with a keen interest in MLOps and AI and ML in the cloud. He lectures at the UWTSD School of Applied Computing on several subjects, including ML application, data analysis, and virtualisation.
	Mr. Nathan Morgan Wales Institute of Digital Information, University of Wales, Trinity, St. Davies, UK	Nathan has held the positions of researcher at the Wales Institute of Digital Innovation (WIDI) and lecturer at the University of Wales, Trinity Saint David, for three years. During this time, Nathan specialised in Software Engineering and Cloud computing while maintaining an interest in signal processing, audio, and artificial intelligence.

Tutorial 6: Recent Advances in Speech Language Models

Afternoon Tutorial (PM): 13:45 – 16:45

Hall: Acesso

Description

Recent advances in representation learning make it possible to build spoken language processing applications on top of automatically discovered representations or units without relying on any textual resources or automatic speech recognition tools. These new developments represent a unique opportunity to redefine the entire field of speech and language processing, opening up the development of applications for under-resourced and unwritten languages while incorporating the richness and expressivity of oral language. This tutorial will, therefore, discuss the foundations and recent advancements of the new sub-field, which we refer to as Speech Language Modeling. We will present the relevant components composing the whole pipeline of speech language models and some of their applications and discuss how such a complex pipeline should be evaluated. We will additionally discuss future research directions as well as how progress should be measured in this new field of research.

	Dr. Yossi Adi The Hebrew University of Jerusalem, Israel	Yossi Adi is an Assistant Professor at the Hebrew University of Jerusalem, the school of computer science and engineering. Prior to his current position, Yossi was a staff research scientist at Meta’s Fundamental AI Research (FAIR) team. Yossi holds a Ph.D. in computer science from Bar-Ilan University and has received several prestigious awards, including the IAAI Best Doctoral Dissertation Award (2020) and the Alon scholarship (2023). Yossi’s research spans core machine learning and deep learning algorithms with a specific emphasis on their application to spoken language modeling. Yossi has published various papers on speech language models at top-tier machine-learning, natural language processing, and speech conferences and journals. To name a few Lakhotia, et al. (2021), Kharitonov, et al. (2021), Polyak, et al. (2021), Kreuk, et al. (2022), Sicherman, and Adi (2023), Hassid, et al. (2024). Yossi has been a member of several technical committees, including the IEEE Machine Learning for Signal Processing Technical Committee (MLSP), the Workshop on Machine Learning in Speech and Language Processing (MLSLP), released an OSS package for spoken language processing (textless-lib) and is co-organizing two special sessions on the topic at Interspeech-2024.
	Dr. Soumi Maiti Carnegie Mellon University, USA	Soumi Maiti is a post-doctoral researcher at the Language Technology Institute at Carnegie Mellon University. Her research focuses on the application of machine learning in speech processing systems, with a particular interest in understanding human perception of speech and languages. Soumi holds a Ph.D. in computer science from the City University of New York. She has held previous positions at Apple, Google, and Interactions LLC in various capacities. Her recent research studies include speech language models (Maiti, et al. 2023) and establishing evaluation of speech generative models (Maiti, et al., 2023, Saeki, et al. 2024). She has previously taken part in education short course Inclusive Neural Speech Synthesis in ICASSP 2022.
	Prof. Shinji Watanabe Carnegie Mellon University, USA	Shinji Watanabe is an Associate Professor at Carnegie Mellon University. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 400 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. His recent studies include speech foundation model studies, e.g., self-supervised learning (Mohamed, et al. 2022), reproduction of OpenAI whisper models (Peng, et al. 2023), and speech language models (Maiti, et al., 2023, Huang, et al. 2023). He is also interested in establishing the evaluation of speech generative models based on speech language models (Maiti, et al., 2023, Saeki, et al. 2024). He has intensive experience in conducting tutorials, e.g., ICASSP 2012, 2021, 2022, and Interspeech 2016, 2019, 2022, and 2023, including ”Self-supervised Representation Learning for Speech Processing” in Interspeech 2022 related to this tutorial proposal. He is an IEEE and ISCA Fellow.

Dr. Yossi Adi

The Hebrew University of Jerusalem, Israel

Yossi Adi is an Assistant Professor at the Hebrew University of Jerusalem, the school of computer science and engineering. Prior to his current position, Yossi was a staff research scientist at Meta’s Fundamental AI Research (FAIR) team. Yossi holds a Ph.D. in computer science from Bar-Ilan University and has received several prestigious awards, including the IAAI Best Doctoral Dissertation Award (2020) and the Alon scholarship (2023). Yossi’s research spans core machine learning and deep learning algorithms with a specific emphasis on their application to spoken language modeling. Yossi has published various papers on speech language models at top-tier machine-learning, natural language processing, and speech conferences and journals. To name a few Lakhotia, et al. (2021), Kharitonov, et al. (2021), Polyak, et al. (2021), Kreuk, et al. (2022), Sicherman, and Adi (2023), Hassid, et al. (2024). Yossi has been a member of several technical committees, including the IEEE Machine Learning for Signal Processing Technical Committee (MLSP), the Workshop on Machine Learning in Speech and Language Processing (MLSLP), released an OSS package for spoken language processing (textless-lib) and is co-organizing two special sessions on the topic at Interspeech-2024.

Dr. Soumi Maiti

Carnegie Mellon University, USA

Soumi Maiti is a post-doctoral researcher at the Language Technology Institute at Carnegie Mellon University. Her research focuses on the application of machine learning in speech processing systems, with a particular interest in understanding human perception of speech and languages. Soumi holds a Ph.D. in computer science from the City University of New York. She has held previous positions at Apple, Google, and Interactions LLC in various capacities. Her recent research studies include speech language models (Maiti, et al. 2023) and establishing evaluation of speech generative models (Maiti, et al., 2023, Saeki, et al. 2024). She has previously taken part in education short course Inclusive Neural Speech Synthesis in ICASSP 2022.

Prof. Shinji Watanabe

Carnegie Mellon University, USA

Shinji Watanabe is an Associate Professor at Carnegie Mellon University. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 400 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. His recent studies include speech foundation model studies, e.g., self-supervised learning (Mohamed, et al. 2022), reproduction of OpenAI whisper models (Peng, et al. 2023), and speech language models (Maiti, et al., 2023, Huang, et al. 2023). He is also interested in establishing the evaluation of speech generative models based on speech language models (Maiti, et al., 2023, Saeki, et al. 2024). He has intensive experience in conducting tutorials, e.g., ICASSP 2012, 2021, 2022, and Interspeech 2016, 2019, 2022, and 2023, including ”Self-supervised Representation Learning for Speech Processing” in Interspeech 2022 related to this tutorial proposal. He is an IEEE and ISCA Fellow.

Tutorial 7: Intracortical Brain-Computer Interface (iBCI) Speech Neuroprosthesis Foundations and Techniques

Afternoon Tutorial (PM): 13:45 – 16:45

Hall: Melambus

Description

Individuals with severe speech and motor impairment (SSMI) lose their ability to produce audible, intelligible speech. Recent progress in intracortical brain-computer interfaces (iBCIs) offer a promising path to restoring naturalistic communication. These iBCIs decode attempted speech using single-neuron resolution neural activity from the areas of the cortex that generate orofacial movement and speech. These iBCI speech neuroprostheses will have a significant impact on the SSMI population’s quality of life and agency. In the tutorial’s presentation component, we will cover SSMI speech production foundations and showcase results from two BrainGate2 clinical trial participants that have chronically implanted Utah arrays in speech-related cortical areas. The tutorial will discuss various techniques for decoding attempted speech from intracortical signals that differ in their decoded output target – speech articulators, phonemes, words, sentences, audible speech. In the interactive jupyter-style notebook component, we will cover intracortical data pre-processing and decoding techniques.

	Dr. Stephanie Haro Brown School of Engineering, member of the BrainGate consortium, USA	Stephanie Haro is a Postdoctoral Research Associate at the Brown School of Engineering and member of the BrainGate consortium. Steph received a BS in Electrical Engineering from Brown University (2017) and a PhD in Speech and Hearing Bioscience and Technology from Harvard (2023). She completed her dissertation at MIT Lincoln Laboratory under the supervision of Thomas Quatieri and Christpher Smalt. Her prior work has leveraged computational modeling and non-invasive cortical recordings to study auditory attention decoding and hearing-impaired speech perception in multi-talker environments (Google Scholar). She gathered relevant tutorial experience through her work with the online Cognition and Natural Sensory Processing (CNSP) workshop and MIT Introduction to Technology, Engineering, and Science portfolio of programs.
	Mr. Chaofei Fan Stanford Neural Prosthetics Translational Lab, member of the BrainGate consortium	Chaofei Fan is a PhD student at the Stanford Neural Prosthetics Translational Lab and a member of the BrainGate consortium. His research focuses on leveraging AI to develop high-performance and reliable neuroprostheses for restoring communication abilities in individuals with speech impairment (Google scholar). He received a BS in Computer Science from Zhejiang University and Simon Fraser University (2011) and an MS in Computer Science from Stanford University (2023). Prior to pursuing PhD, he developed ASR and NLP systems at Mobvoi, a leading AI startup in China.
	Dr. Maitreyee Wairagkar Postdoctoral Scholar, University of California Davis, Neuroprosthetics Lab, member of the BrainGate consortium, USA	Maitreyee Wairagkar is a Postdoctoral Scholar at University of California Davis, USA in Neuroprosthetics Lab and member of the BrainGate consortium. Her research focuses on developing neuroprostheses and assistive neurotechnology using artificial intelligence to restore lost speech and movement in individuals with severe neurological conditions via implanted as well as non-invasive brain-computer interfaces. She earned her MEng (2014) and PhD (2019) degrees in AI and Cybernetics from the University of Reading, UK where she was also a teaching assistant for 7 years. Previously, she was a Postdoctoral Research Associate at Imperial College London, UK working on conversational AI and affective robotics for dementia care. Her publications are available on Google Scholar and her work and teaching experience can be found on her website.

Dr. Stephanie Haro

Brown School of Engineering, member of the BrainGate consortium, USA

Stephanie Haro is a Postdoctoral Research Associate at the Brown School of Engineering and member of the BrainGate consortium. Steph received a BS in Electrical Engineering from Brown University (2017) and a PhD in Speech and Hearing Bioscience and Technology from Harvard (2023). She completed her dissertation at MIT Lincoln Laboratory under the supervision of Thomas Quatieri and Christpher Smalt. Her prior work has leveraged computational modeling and non-invasive cortical recordings to study auditory attention decoding and hearing-impaired speech perception in multi-talker environments (Google Scholar). She gathered relevant tutorial experience through her work with the online Cognition and Natural Sensory Processing (CNSP) workshop and MIT Introduction to Technology, Engineering, and Science portfolio of programs.

Mr. Chaofei Fan

Stanford Neural Prosthetics Translational Lab, member of the BrainGate consortium

Chaofei Fan is a PhD student at the Stanford Neural Prosthetics Translational Lab and a member of the BrainGate consortium. His research focuses on leveraging AI to develop high-performance and reliable neuroprostheses for restoring communication abilities in individuals with speech impairment (Google scholar). He received a BS in Computer Science from Zhejiang University and Simon Fraser University (2011) and an MS in Computer Science from Stanford University (2023). Prior to pursuing PhD, he developed ASR and NLP systems at Mobvoi, a leading AI startup in China.

Dr. Maitreyee Wairagkar

Postdoctoral Scholar, University of California Davis, Neuroprosthetics Lab, member of the BrainGate consortium, USA

Maitreyee Wairagkar is a Postdoctoral Scholar at University of California Davis, USA in Neuroprosthetics Lab and member of the BrainGate consortium. Her research focuses on developing neuroprostheses and assistive neurotechnology using artificial intelligence to restore lost speech and movement in individuals with severe neurological conditions via implanted as well as non-invasive brain-computer interfaces. She earned her MEng (2014) and PhD (2019) degrees in AI and Cybernetics from the University of Reading, UK where she was also a teaching assistant for 7 years. Previously, she was a Postdoctoral Research Associate at Imperial College London, UK working on conversational AI and affective robotics for dementia care. Her publications are available on Google Scholar and her work and teaching experience can be found on her website.

Tutorial 8: Optimal transport (OT) in speech: OT meets speech

Afternoon Tutorial (PM): 13:45 – 16:45

Hall: Syndicate 2.1

Description

Optimal transport (OT) has garnered widespread attention in numerous research tasks involving the comparison or manipulation of probability distributions of underlying features or objects. The role of applying OT lies in its capacity to provide an effective metric measure, such as discrepancy or distance measure, between two probability distributions. It facilitates the establishment of correspondences between sets of samples, incorporating a geometric-aware distance between distributions. Due to the temporal sequential property of speech, integrating OT into speech processing remains a challenge, leading to its under-exploration in the domain. The purpose of this tutorial is to promote the adoption of OT in speech processing. We will introduce the fundamental mathematical principles of OT and present computational algorithms, with a specific emphasis on those integrated into deep learning model frameworks. Furthermore, our objective is to share our discoveries and insights with researchers who are interested in exploring this promising field.

	Dr. Xugang Lu Advanced Speech Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan	Dr. Xugang Lu (Member, IEEE) earned his B.S. and M.S. degrees from Harbin Institute of Technology (HIT), China, in 1994 and 1996, respectively. He completed his Ph.D. at the National Lab of Pattern Recognition, Chinese Academy of Sciences, in 1999. Starting in October 1999, he worked as a research fellow at Nanyang Technological University, Singapore. In December 2001, he became a postdoctoral fellow at McMaster University, Canada. Between April 2003 and April 2008, he served as an assistant professor at the Faculty of School of Information Science at the Japan Advanced Institute of Science and Technology (JAIST). After May 2008, he joined the Advanced Telecommunications Research Institute (ATR) Spoken Language Communication Research Labs and later moved to the National Institute of Information and Communications Technology (NICT), Japan, as a senior researcher. From 2017 to 2023, he held a guest professor position at Doshisha University, Japan. His primary research interests include speech communications and statistical machine learning.
	Dr. Yu Tsao Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan	Dr. Yu Tsao (Senior Member, IEEE) obtained his B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taiwan, in 1999 and 2001, respectively. He earned his Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, USA, in 2008. From 2009 to 2011, he served as a Researcher at the National Institute of Information and Communications Technology (NICT), Japan. Currently, he holds the position of research fellow (professor) and deputy director at the Research Center for Information Technology Innovation, Academia Sinica, Taiwan. Additionally, he is a jointly appointed professor with the Department of Electrical Engineering at Chung Yuan Christian University, Taiwan. Dr. Tsao’s research interests encompass assistive oral communication technologies, audio coding, and bio-signal processing. He serves as an Associate Editor for the IEEE/ACM Transactions on Audio, Speech, and Language Processing and IEEE Signal Processing Letters. Dr. Tsao received the Academia Sinica Career Development Award in 2017, national innovation awards from 2018 to 2021, the Future Tech Breakthrough Award in 2019, and the Outstanding Elite Award from the Chung Hwa Rotary Educational Foundation in 2019–2020. Moreover, he is the corresponding author of a paper that received the 2021 IEEE Signal Processing Society (SPS) Young Author Best Paper Award.

Dr. Xugang Lu

Advanced Speech Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan

Dr. Xugang Lu (Member, IEEE) earned his B.S. and M.S. degrees from Harbin Institute of Technology (HIT), China, in 1994 and 1996, respectively. He completed his Ph.D. at the National Lab of Pattern Recognition, Chinese Academy of Sciences, in 1999. Starting in October 1999, he worked as a research fellow at Nanyang Technological University, Singapore. In December 2001, he became a postdoctoral fellow at McMaster University, Canada. Between April 2003 and April 2008, he served as an assistant professor at the Faculty of School of Information Science at the Japan Advanced Institute of Science and Technology (JAIST). After May 2008, he joined the Advanced Telecommunications Research Institute (ATR) Spoken Language Communication Research Labs and later moved to the National Institute of Information and Communications Technology (NICT), Japan, as a senior researcher. From 2017 to 2023, he held a guest professor position at Doshisha University, Japan. His primary research interests include speech communications and statistical machine learning.

Dr. Yu Tsao

Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan

Dr. Yu Tsao (Senior Member, IEEE) obtained his B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taiwan, in 1999 and 2001, respectively. He earned his Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, USA, in 2008. From 2009 to 2011, he served as a Researcher at the National Institute of Information and Communications Technology (NICT), Japan. Currently, he holds the position of research fellow (professor) and deputy director at the Research Center for Information Technology Innovation, Academia Sinica, Taiwan. Additionally, he is a jointly appointed professor with the Department of Electrical Engineering at Chung Yuan Christian University, Taiwan. Dr. Tsao’s research interests encompass assistive oral communication technologies, audio coding, and bio-signal processing. He serves as an Associate Editor for the IEEE/ACM Transactions on Audio, Speech, and Language Processing and IEEE Signal Processing Letters. Dr. Tsao received the Academia Sinica Career Development Award in 2017, national innovation awards from 2018 to 2021, the Future Tech Breakthrough Award in 2019, and the Outstanding Elite Award from the Chung Hwa Rotary Educational Foundation in 2019–2020. Moreover, he is the corresponding author of a paper that received the 2021 IEEE Signal Processing Society (SPS) Young Author Best Paper Award.

Tutorials time:

Morning tutorials (AM): 9:00 – 12:15

Afternoon Tutorials (PM): 13:45 – 16:45

Full Day Tutorials (AM+PM): 9:00 – 17:00

Tutorials will take place on Sunday, 1 September at the KICC in Kos.

Tutorials Rates

Student / Retiree

AM/PM One Tutorial – € 145
AM/PM Two Tutorials – € 215
Full day Tutorial – € 195

Regular

AM/PM One Tutorial – € 250
AM/PM Two Tutorials – € 410
Full day Tutorial – € 390

One